-
Notifications
You must be signed in to change notification settings - Fork 958
Perf: Optimize comparison kernels for inlined views #7731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The performance benchmark result, the lt scalar StringViewArray shows more than 30% performance faster, and no obvious regression for all string view testing. lt scalar StringViewArray 1.00 20.9±1.35ms ? ?/sec 1.31 27.4±1.23ms ? ?/sec critcmp main fast_path_view --filter "iew"
group fast_path_view main
----- -------------- ----
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar complex 1.01 1257.0±18.64µs ? ?/sec 1.00 1249.7±10.10µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar contains 1.00 1235.9±18.20µs ? ?/sec 1.00 1230.0±16.30µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar ends with 1.03 697.8±32.61µs ? ?/sec 1.00 677.9±8.49µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar starts with 1.01 613.6±9.26µs ? ?/sec 1.00 605.7±11.02µs ? ?/sec
eq StringViewArray StringViewArray 1.00 4.2±0.05ms ? ?/sec 1.06 4.4±0.10ms ? ?/sec
eq long same prefix strings StringViewArray 1.01 510.0±13.43µs ? ?/sec 1.00 503.6±7.12µs ? ?/sec
eq scalar StringViewArray 13 bytes 1.00 4.0±0.03ms ? ?/sec 1.01 4.0±0.04ms ? ?/sec
eq scalar StringViewArray 4 bytes 1.00 4.0±0.04ms ? ?/sec 1.05 4.2±0.04ms ? ?/sec
eq scalar StringViewArray 6 bytes 1.00 4.0±0.04ms ? ?/sec 1.05 4.2±0.06ms ? ?/sec
like_utf8view scalar complex 1.00 82.6±0.54ms ? ?/sec 1.00 82.8±0.89ms ? ?/sec
like_utf8view scalar contains 1.00 76.7±0.56ms ? ?/sec 1.01 77.2±0.63ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.00 19.0±0.18ms ? ?/sec 1.00 19.0±0.30ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 19.1±0.22ms ? ?/sec 1.01 19.3±0.19ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.01 19.2±0.27ms ? ?/sec 1.00 19.1±0.27ms ? ?/sec
like_utf8view scalar equals 1.00 14.1±0.13ms ? ?/sec 1.00 14.1±0.13ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.01 18.6±0.58ms ? ?/sec 1.00 18.5±0.61ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 10.8±0.15ms ? ?/sec 1.01 11.0±0.22ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.00 18.7±0.34ms ? ?/sec 1.00 18.8±0.41ms ? ?/sec
long same prefix strings like_utf8view scalar complex 1.00 619.0±3.76µs ? ?/sec 1.00 616.8±3.57µs ? ?/sec
long same prefix strings like_utf8view scalar contains 1.02 1857.3±74.05µs ? ?/sec 1.00 1815.2±22.35µs ? ?/sec
long same prefix strings like_utf8view scalar ends with 1.00 609.7±7.74µs ? ?/sec 1.00 611.9±6.24µs ? ?/sec
long same prefix strings like_utf8view scalar equals 1.01 195.6±3.62µs ? ?/sec 1.00 193.5±3.08µs ? ?/sec
long same prefix strings like_utf8view scalar starts with 1.04 695.1±82.83µs ? ?/sec 1.00 671.2±6.65µs ? ?/sec
lt long same prefix strings StringViewArray 1.00 491.0±11.87µs ? ?/sec 1.00 489.8±5.67µs ? ?/sec
lt scalar StringViewArray 1.00 20.9±1.35ms ? ?/sec 1.31 27.4±1.23ms ? ?/sec
neq long same prefix strings StringViewArray 1.04 516.6±19.16µs ? ?/sec 1.00 499.0±6.16µs ? ?/sec |
Hi @Dandandan Based on what we’re seeing, it looks like directly comparing the raw u128 views can violate true lexicographic ordering under little‑endian layout—fuzz has caught cases where padding and byte‑order make u128.cmp() pick the wrong winner. |
I think we have to make sure only to compare the inlined string, not the entire u128/view value! |
I think for it to work we probably should:
|
Thank you @Dandandan , addressed in latest PR, and add benchmark, the result is amazing, 3.x faster and 10x faster. critcmp main fast_path_view --filter "inlined bytes"
group fast_path_view main
----- -------------- ----
eq StringViewArray StringViewArray inlined bytes 1.00 3.7±0.09ms ? ?/sec 3.33 12.3±0.15ms ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 5.2±0.13ms ? ?/sec 9.80 51.3±1.76ms ? ?/sec |
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks good to me -- thank you @zhuqi-lucas
I am rerunning the benchmarks to double check performance but it looks great
arrow-ord/src/cmp.rs
Outdated
let r_bytes = unsafe { GenericByteViewArray::<T>::inline_value(r_view, 12) }; | ||
return l_bytes.cmp(r_bytes).is_eq(); | ||
} | ||
|
||
// # Safety | ||
// The index is within bounds as it is checked in value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be good to move the Safety
comment up along with the calls to unsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review, addressed it in latest PR.
🤖: Benchmark completed Details
|
@@ -566,12 +566,18 @@ impl<'a, T: ByteViewType> ArrayOrd for &'a GenericByteViewArray<T> { | |||
type Item = (&'a GenericByteViewArray<T>, usize); | |||
|
|||
fn is_eq(l: Self::Item, r: Self::Item) -> bool { | |||
let l_view = unsafe { l.0.views().get_unchecked(l.1) }; | |||
let r_view = unsafe { r.0.views().get_unchecked(r.1) }; | |||
if l.0.data_buffers().is_empty() && r.0.data_buffers().is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code probably depends on inlining / moving this check outside of the loop?
Should we add #[inline(always)]
to it to make sure it keeps doing that? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for this good suggestion, i addressed it in latest PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the #[inline(always)] to eq(), lt() and compare_byte_view(), i am sure if it's the right way.
My results match the 3x / 10x improvement 🚀
|
🤖 |
arrow-ord/src/cmp.rs
Outdated
if left.data_buffers().is_empty() && right.data_buffers().is_empty() { | ||
// Only need to compare the inlined bytes | ||
let l_bytes = unsafe { | ||
GenericByteViewArray::<T>::inline_value(left.views().get_unchecked(left_idx), 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this might not be correct for > / < in case of zeros in the values (e.g. https://en.wikipedia.org/wiki/Null_character).
I think we should add the length to the comparison (or do something smarter? I am not sure if that's possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns greater (instead of Equal):
> println!("{:?}", "\0".cmp(""))
> Greater
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @Dandandan , let me find a good way to handle it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promising results!
I think we need some test to show it is correct with null characters and different lengths, I think we need to modify the implementation a bit.
🤖: Benchmark completed Details
|
Thank you @Dandandan , updated the code, and we still can gain some performance improvement for latest code. |
@Dandandan @alamb critcmp main fast_path_view --filter "iew"
group fast_path_view main
----- -------------- ----
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar complex 1.01 1252.0±14.91µs ? ?/sec 1.00 1242.8±12.91µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar contains 1.01 1224.7±7.64µs ? ?/sec 1.00 1215.0±22.97µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar ends with 1.01 673.7±3.22µs ? ?/sec 1.00 670.3±2.76µs ? ?/sec
StringViewArray: regexp_matches_utf8view scalar benchmarks/regexp_matches_utf8view scalar starts with 1.02 620.5±7.92µs ? ?/sec 1.00 609.3±5.52µs ? ?/sec
eq StringViewArray StringViewArray 1.00 4.3±0.02ms ? ?/sec 1.10 4.8±0.03ms ? ?/sec
eq StringViewArray StringViewArray inlined bytes 1.00 8.2±0.05ms ? ?/sec 1.57 12.8±0.10ms ? ?/sec
eq long same prefix strings StringViewArray 1.04 521.7±11.72µs ? ?/sec 1.00 500.1±12.99µs ? ?/sec
eq scalar StringViewArray 13 bytes 1.00 3.8±0.04ms ? ?/sec 1.01 3.9±0.04ms ? ?/sec
eq scalar StringViewArray 4 bytes 1.00 4.0±0.04ms ? ?/sec 1.02 4.1±0.03ms ? ?/sec
eq scalar StringViewArray 6 bytes 1.00 4.0±0.08ms ? ?/sec 1.00 4.0±0.03ms ? ?/sec
like_utf8view scalar complex 1.01 80.5±0.69ms ? ?/sec 1.00 80.0±0.40ms ? ?/sec
like_utf8view scalar contains 1.00 76.5±0.65ms ? ?/sec 1.00 76.7±0.63ms ? ?/sec
like_utf8view scalar ends with 13 bytes 1.00 18.8±0.21ms ? ?/sec 1.01 19.1±0.46ms ? ?/sec
like_utf8view scalar ends with 4 bytes 1.00 18.9±0.19ms ? ?/sec 1.00 18.9±0.27ms ? ?/sec
like_utf8view scalar ends with 6 bytes 1.01 18.9±0.23ms ? ?/sec 1.00 18.8±0.19ms ? ?/sec
like_utf8view scalar equals 1.02 14.3±0.62ms ? ?/sec 1.00 14.0±0.17ms ? ?/sec
like_utf8view scalar starts with 13 bytes 1.00 18.1±0.19ms ? ?/sec 1.01 18.2±0.12ms ? ?/sec
like_utf8view scalar starts with 4 bytes 1.00 10.5±0.10ms ? ?/sec 1.00 10.5±0.16ms ? ?/sec
like_utf8view scalar starts with 6 bytes 1.00 18.4±0.20ms ? ?/sec 1.00 18.4±0.15ms ? ?/sec
long same prefix strings like_utf8view scalar complex 1.01 624.2±32.90µs ? ?/sec 1.00 615.5±13.97µs ? ?/sec
long same prefix strings like_utf8view scalar contains 1.02 1849.8±73.20µs ? ?/sec 1.00 1813.0±19.86µs ? ?/sec
long same prefix strings like_utf8view scalar ends with 1.00 593.1±4.95µs ? ?/sec 1.01 598.3±13.30µs ? ?/sec
long same prefix strings like_utf8view scalar equals 1.03 195.4±5.64µs ? ?/sec 1.00 190.2±1.29µs ? ?/sec
long same prefix strings like_utf8view scalar starts with 1.00 671.4±5.58µs ? ?/sec 1.01 676.8±20.18µs ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 47.6±1.05ms ? ?/sec 1.11 52.8±1.38ms ? ?/sec
lt long same prefix strings StringViewArray 1.06 512.5±9.08µs ? ?/sec 1.00 482.1±5.10µs ? ?/sec
lt scalar StringViewArray 1.00 26.8±0.92ms ? ?/sec 1.04 28.0±1.56ms ? ?/sec
neq long same prefix strings StringViewArray 1.03 521.1±9.77µs ? ?/sec 1.00 503.7±9.93µs ? ?/sec Still have 57% and %11 for the inlined case: eq StringViewArray StringViewArray inlined bytes 1.00 8.2±0.05ms ? ?/sec 1.57 12.8±0.10ms ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 47.6±1.05ms ? ?/sec 1.11 52.8±1.38ms ? ?/sec |
arrow-ord/src/cmp.rs
Outdated
let r_len = *r_view as u32; | ||
// This is a fast path for equality check. | ||
// We don't need to look at the actual bytes to determine if they are equal. | ||
if l_len != r_len { | ||
return false; | ||
} | ||
|
||
// When both len are same, we can compare the inlined bytes, this can handle the corner case after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also compare earlier based on actual length instead like in the other place?
i am also not sure comparing length here is really a fast path, because slice equality also checks on slice first before accessing the element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we can do the eq
case on the u128 value directly @zhuqi-lucas (compare both length and value to be equal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @Dandandan , thank you for catching this! Addressed in latest PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, eq optimize a lot, 4.75 faster.
critcmp main fast_path_view --filter "inlined bytes"
group fast_path_view main
----- -------------- ----
eq StringViewArray StringViewArray inlined bytes 1.00 2.7±0.01ms ? ?/sec 4.75 12.8±0.10ms ? ?/sec
lt StringViewArray StringViewArray inlined bytes 1.00 47.3±0.55ms ? ?/sec 1.12 52.8±1.38ms ? ?/sec
Similar changes for datafusion side, it shows good result for Q11 sort-tpch which is mostly using inlined bytes to compare. |
let r_view = unsafe { r.0.views().get_unchecked(r.1) }; | ||
let l_len = *l_view as u32 as usize; | ||
let r_len = *r_view as u32 as usize; | ||
let l_bytes = unsafe { GenericByteViewArray::<T>::inline_value(l_view, l_len) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the main thing I see is what we could do next is do inline_value
on a u128 level rather than &[u8] and compare the values on u128 and length if equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan , interesting, i will investigate this idea!
All in all it's good, I think we can improve it further, I'll let you decide @zhuqi-lucas if you want to do it in this PR or later! I think we can also improve the non-inlined case with prefixes in the views, I will create an issue about it later. |
Thank you @Dandandan , let's create the issues for further improvement, i am willing to investigate and take the related issues. |
Which issue does this PR close?
Add fast path for empty buffer case to skip compute len, and use view(u128) to compare directly.
Closes #7621
Rationale for this change
Add fast path for empty buffer case to skip compute len, and use view(u128) to compare directly.
What changes are included in this PR?
Add fast path for empty buffer case to skip compute len, and use view(u128) to compare directly.
Are there any user-facing changes?
No
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.